Peter Norvig on Statistical Learning - Dictionary of Arguments

Author	Concept	Summary/Quotes	Sources
Philosophy Dictionary of Arguments Home


Statistical learning: Statistical learning involves using algorithms and models to analyze and make predictions based on data based on the assumption that the data we have is a sample from some larger population. It encompasses techniques like regression, classification, and clustering, utilizing statistical principles to extract meaningful patterns and relationships from datasets. See also Machine Learning, Data, Classification, Prediction, Generalization, Induction, Learning, Artificial Intelligence, Algorithms, Models. _____________ Annotation: The above characterizations of concepts are neither definitions nor exhausting presentations of problems related to them. Instead, they are intended to give a short introduction to the contributions below. – Lexicon of Arguments.

> Norvig, Peter	> Statistical Learning	Peter Norvig on Statistical Learning - Dictionary of Arguments Norvig I 825 Statistical learning/Norvig/Russell: Statistical learning methods range from simple calculation of averages to the construction of complex models such as Bayesian networks. They have applications throughout computer science, engineering, computational biology, neuroscience, psychology, and physics. ((s) Cf. >Prior knowledge/Norvig). Bayesian learning methods: formulate learning as a form of probabilistic inference, using the observations to update a prior distribution over hypotheses. This approach provides a good way to implement Ockham’s razor, but quickly becomes intractable for complex hypothesis spaces. Maximum a posteriori (MAP) learning: selects a single most likely hypothesis given the data. The hypothesis prior is still used and the method is often more tractable than full Bayesian learning. Maximum-likelihood learning: simply selects the hypothesis that maximizes the likelihood of the data; it is equivalent to MAP learning with a uniform prior. In simple cases such as linear regression and fully observable Bayesian networks, maximum-likelihood solutions can be found easily in closed form. Naive Bayes learning is a particularly effective technique that scales well. Hidden variables/latent variables: When some variables are hidden, local maximum likelihood solutions can be found using the EM algorithm. Applications include clustering using mixtures of Gaussians, learning Bayesian networks, and learning hidden Markov models. Norvig I 823 EM Algorithm: Each involves computing expected values of hidden variables for each example and then recomputing the parameters, using the expected values as if they were observed values. Norvig I 825 Learning the structure of Bayesian networks is an example of model selection. This usually involves a discrete search in the space of structures. Some method is required for trading off model complexity against degree of fit. Nonparametric models: represent a distribution using the collection of data points. Thus, the number of parameters grows with the training set. Nearest-neighbors methods look at the examples nearest to the point in question, whereas kernel methods form a distance-weighted combination of all the examples. History: The application of statistical learning techniques in AI was an active area of research in the early years (see Duda and Hart, 1973)⁽¹⁾ but became separated from mainstream AI as the latter field concentrated on symbolic methods. A resurgence of interest occurred shortly after the introduction of Bayesian network models in the late 1980s; at roughly the same time, Norvig I 826 statistical view of neural network learning began to emerge. In the late 1990s, there was a noticeable convergence of interests in machine learning, statistics, and neural networks, centered on methods for creating large probabilistic models from data. Naïve Bayes model: is one of the oldest and simplest forms of Bayesian network, dating back to the 1950s. Its surprising success is partially explained by Domingos and Pazzani (1997)⁽²⁾. A boosted form of naive Bayes learning won the first KDD Cup data mining competition (Elkan, 1997)⁽³⁾. Heckerman (1998)⁽⁴⁾ gives an excellent introduction to the general problem of Bayes net learning. Bayesian parameter learning with Dirichlet priors for Bayesian networks was discussed by Spiegelhalter et al. (1993)⁽⁵⁾. The BUGS software package (Gilks et al., 1994)(6) incorporates many of these ideas and provides a very powerful tool for formulating and learning complex probability models. The first algorithms for learning Bayes net structures used conditional independence tests (Pearl, 1988(7); Pearl and Verma, 1991(8)). Spirtes et al. (1993)(9) developed a comprehensive approach embodied in the TETRAD package for Bayes net learning. Algorithmic improvements since then led to a clear victory in the 2001 KDD Cup data mining competition for a Bayes net learning method (Cheng et al., 2002)⁽¹⁰⁾. (The specific task here was a bioinformatics problem with 139,351 features!) A structure-learning approach based on maximizing likelihood was developed by Cooper and Herskovits (1992)⁽¹¹⁾ and improved by Heckerman et al. (1994)(12). Several algorithmic advances since that time have led to quite respectable performance in the complete-data case (Moore and Wong, 2003(13); Teyssier and Koller, 2005(14)). One important component is an efficient data structure, the AD-tree, for caching counts over all possible combinations of variables and values (Moore and Lee, 1997)(15). Friedman and Goldszmidt (1996)(16) pointed out the influence of the representation of local conditional distributions on the learned structure. Hidden variables/missing data: The general problem of learning probability models with hidden variables and missing data was addressed by Hartley (1958)⁽¹⁷⁾, who described the general idea of what was later called EM and gave several examples. Further impetus came from the Baum–Welch algorithm for HMM learning (Baum and Petrie, 1966)(18), which is a special case of EM. The paper by Dempster, Laird, and Rubin (1977)⁽¹⁹⁾, which presented the EM algorithm in general form and analyzed its convergence, is one of the most cited papers in both computer science and statistics. (Dempster himself views EM as a schema rather than an algorithm, since a good deal of mathematical work may be required before it can be applied to a new family of distributions.) McLachlan and Krishnan (1997)⁽²⁰⁾ devote an entire book to the algorithm and its properties. The specific problem of learning mixture models, including mixtures of Gaussians, is covered by Titterington et al. (1985)⁽²¹⁾. Within AI, the first successful system that used EM for mixture modeling was AUTOCLASS (Cheeseman et al., 1988⁽²²⁾; Cheeseman and Stutz, 1996⁽²³⁾). AUTOCLASS has been applied to a number of real-world scientific classification tasks, including the discovery of new types of stars from spectral data (Goebel et al., 1989)⁽²⁴⁾ and new classes of proteins and introns in DNA/protein sequence databases (Hunter and States, 1992)⁽²⁵⁾. Maximum-likelihood parameter learning: For maximum-likelihood parameter learning in Bayes nets with hidden variables, EM and gradient-based methods were introduced around the same time by Lauritzen (1995)⁽²⁶⁾, Russell et al. (1995)⁽²⁷⁾, and Binder et al. (1997a)⁽²⁸⁾. The structural EM algorithm was developed by Friedman (1998)⁽²⁹⁾ and applied to maximum-likelihood learning of Bayes net structures with Norvig I 827 latent variables. Friedman and Koller (2003)⁽³⁰⁾. describe Bayesian structure learning. Causality/causal network: The ability to learn the structure of Bayesian networks is closely connected to the issue of recovering causal information from data. That is, is it possible to learn Bayes nets in such a way that the recovered network structure indicates real causal influences? For many years, statisticians avoided this question, believing that observational data (as opposed to data generated from experimental trials) could yield only correlational information—after all, any two variables that appear related might in fact be influenced by a third, unknown causal factor rather than influencing each other directly. Pearl (2000)⁽³¹⁾ has presented convincing arguments to the contrary, showing that there are in fact many cases where causality can be ascertained and developing the causal network formalism to express causes and the effects of intervention as well as ordinary conditional probabilities. Literature on statistical learning and pattern recognition: Good texts on Bayesian statistics include those by DeGroot (1970)⁽³²⁾, Berger (1985)⁽³³⁾, and Gelman et al. (1995)⁽³⁴⁾. Bishop (2007)⁽³⁵⁾ and Hastie et al. (2009)⁽³⁶⁾ provide an excellent introduction to statistical machine learning. For pattern classification, the classic text for many years has been Duda and Hart (1973)⁽¹⁾, now updated (Duda et al., 2001)⁽³⁷⁾. The annual NIPS (Neural Information Processing Conference) conference, whose proceedings are published as the series Advances in Neural Information Processing Systems, is now dominated by Bayesian papers. Papers on learning Bayesian networks also appear in the Uncertainty in AI and Machine Learning conferences and in several statistics conferences. Journals specific to neural networks include Neural Computation, Neural Networks, and the IEEE Transactions on Neural Networks. 1. Duda, R. O. and Hart, P. E. (1973). Pattern classification and scene analysis. Wiley. 2. Domingos, P. and Pazzani, M. (1997). On the optimality of the simple Bayesian classifier under zero-one loss. Machine Learning, 29, 103–30. 3. Elkan, C. (1997). Boosting and naive Bayesian learning. Tech. rep., Department of Computer Science and Engineering, University of California, San Diego. 4. Heckerman, D. (1998). A tutorial on learning with Bayesian networks. In Jordan, M. I. (Ed.), Learning in graphical models. Kluwer. 5. Spiegelhalter, D. J., Dawid, A. P., Lauritzen, S., and Cowell, R. (1993). Bayesian analysis in expert systems. Statistical Science, 8, 219–282. 6. Gilks, W. R., Thomas, A., and Spiegelhalter, D. J. (1994). A language and program for complex Bayesian modelling. The Statistician, 43, 169–178. 7. Pearl, J. (1988). Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann. 8. Pearl, J. and Verma, T. (1991). A theory of inferred causation. In KR-91, pp. 441–452. 9. Spirtes, P., Glymour, C., and Scheines, R. (1993). Causation, prediction, and search. Springer-Verlag. 10. Cheng, J., Greiner, R., Kelly, J., Bell, D. A., and Liu, W. (2002). Learning Bayesian networks from data: An information-theory based approach. AIJ, 137, 43–90. 11. Cooper, G. and Herskovits, E. (1992). A Bayesian method for the induction of probabilistic networks from data. Machine Learning, 9, 309–347. 12. Heckerman, D., Geiger, D., and Chickering, D. M. (1994). Learning Bayesian networks: The combination of knowledge and statistical data. Technical report MSR-TR-94-09, Microsoft Research. 13. Moore, A. and Wong, W.-K. (2003). Optimal reinsertion: A new search operator for accelerated and more accurate Bayesian network structure learning. In ICML-03. 14. Teyssier, M. and Koller, D. (2005). Ordering-based search: A simple and effective algorithm for learning Bayesian networks. In UAI-05, pp. 584–590. 15. Moore, A. W. and Lee, M. S. (1997). Cached sufficient statistics for efficient machine learning with large datasets. JAIR, 8, 67–91. 16. Friedman, N. and Goldszmidt, M. (1996). Learning Bayesian networks with local structure. In UAI-96, pp. 252–262. 17. Hartley, H. (1958). Maximum likelihood estimation from incomplete data. Biometrics, 14, 174–194. 18. Baum, L. E. and Petrie, T. (1966). Statistical inference for probabilistic functions of finite state Markov chains. Annals of Mathematical Statistics, 41. 19. Dempster, A. P., Laird, N., and Rubin, D. (1977). Maximum likelihood from incomplete data via the EM algorithm. J. Royal Statistical Society, 39 (Series B), 1–38. 20. McLachlan, G. J. and Krishnan, T. (1997). The EM Algorithm and Extensions. Wiley. 21. Titterington, D. M., Smith, A. F. M., and Makov, U. E. (1985). Statistical analysis of finite mixture distributions. Wiley. 22. Cheeseman, P., Self, M., Kelly, J., and Stutz, J. (1988). Bayesian classification. In AAAI-88, Vol. 2, pp. 607–611. 23. Cheeseman, P. and Stutz, J. (1996). Bayesian classification (AutoClass): Theory and results. In Fayyad, U., Piatesky-Shapiro, G., Smyth, P., and Uthurusamy, R. (Eds.), Advances in Knowledge Discovery and Data Mining. AAAI Press/MIT Press. 24. Goebel, J., Volk, K., Walker, H., and Gerbault, F. (1989). Automatic classification of spectra from the infrared astronomical satellite (IRAS). Astronomy and Astrophysics, 222, L5–L8. 25. Hunter, L. and States, D. J. (1992). Bayesian classification of protein structure. IEEE Expert, 7(4), 67–75. 26. Lauritzen, S. (1995). The EM algorithm for graphical association models with missing data. Computational Statistics and Data Analysis, 19, 191–201. 27. Russell, S. J., Binder, J., Koller, D., and Kanazawa, K. (1995). Local learning in probabilistic networks with hidden variables. In IJCAI-95, pp. 1146–52. 28. Binder, J., Koller, D., Russell, S. J., and Kanazawa, K. (1997a). Adaptive probabilistic networks with hidden variables. Machine Learning, 29, 213–244. 29. Friedman, N. (1998). The Bayesian structural EM algorithm. In UAI-98. 30. Friedman, N. and Koller, D. (2003). Being Bayesian about Bayesian network structure: A Bayesian approach to structure discovery in Bayesian networks. Machine Learning, 50, 95–125. 31. Pearl, J. (2000). Causality: Models, Reasoning, and Inference. Cambridge University Press. 32. DeGroot, M. H. (1970). Optimal Statistical Decisions. McGraw-Hill. 33. Berger, J. O. (1985). Statistical Decision Theory and Bayesian Analysis. Springer Verlag. 34. Gelman, A., Carlin, J. B., Stern, H. S., and Rubin, D. (1995). Bayesian Data Analysis. Chapman & Hall. 35. Bishop, C. M. (2007). Pattern Recognition and Machine Learning. Springer-Verlag. 36. Hastie, T., Tibshirani, R., and Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference and Prediction (2nd edition). Springer- Verlag. 37. Duda, R. O., Hart, P. E., and Stork, D. G. (2001). Pattern Classification (2nd edition). Wiley. _____________ Explanation of symbols: Roman numerals indicate the source, arabic numerals indicate the page number. The corresponding books are indicated on the right hand side. ((s)…): Comment by the sender of the contribution. Translations: Dictionary of Arguments The note [Concept/Author], [Author1]Vs[Author2] or [Author]Vs[term] resp. "problem:"/"solution:", "old:"/"new:" and "thesis:" is an addition from the Dictionary of Arguments. If a German edition is specified, the page numbers refer to this edition.	Norvig I Peter Norvig Stuart J. Russell Artificial Intelligence: A Modern Approach Upper Saddle River, NJ 2010

Send Link

> Counter arguments against Norvig

> Counter arguments in relation to Statistical Learning

Authors A B C D E F G H I J K L M N O P Q R S T U V W Y Z

Concepts A B C D E F G H I J K L M N O P Q R S T U V W Z

Ed. Martin Schulz, access date 2024-04-27

Legal Notice Contact Data protection declaration